Large Vocabulary Continuous Speech Recognition for Estonian Using Morpheme Classes

نویسنده

  • Tanel Alumäe
چکیده

This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To improve language model robustness, we automatically find morpheme classes and interpolate the morpheme model with the classbased model. The language model is trained on a newspaper corpus of 15 million word forms. Clustered triphones with multiple Gaussian mixture components are used for acoustic modeling. The system with interpolated morpheme language model is found to perform significantly better than the baseline word form trigram system in all areas. The word error rate of the best system is 27.3% which is a 10.0% absolute improvement over the baseline system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large Vocabulary Continuous Speech Recognition for Estonian Using Morphemes and Classes

This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To im...

متن کامل

Lemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages

We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as b...

متن کامل

Korean large vocabulary continuous speech recognition using pseudomorpheme units

This paper presents a Korean large vocabulary continuous speech recognition system based on pseudomorpheme units. In Korean, an eojeol (word phrase) is a unit for spacing and a morpheme is the smallest unit with semantic meaning. If the eojeol is used as the dictionary and language modeling unit, the number of the unit becomes enormous. Instead we propose to use modified morpheme or pseudomorph...

متن کامل

Limited-Vocabulary Estonian Continuous Speech Recognition System using Hidden Markov Models

The article presents a limited-vocabulary speaker independent continuous Estonian speech recognition system based on hidden Markov models. The system is trained using an annotated Estonian speech database of 60 speakers, approximately 4 hours in duration. Words are modelled using clustered triphones with multiple Gaussian mixture components. The system is evaluated using a number recognition ta...

متن کامل

On large vocabulary continuous speech recognition of highly inflectional language - czech

A system for large vocabulary continuous speech recognition of highly inflectional language is introduced. Word-based recognition approach is compared with a morpheme-based recognition system. An experiment involving Czech N-best rescoring has been performed with encouraging results.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998